Fifth Lab

Today’s goals. Learn about:

  1. Likelihood theory
  2. Simple linear regression

Workflow

  1. Consider examples related to theoretical concepts
  2. Using R to solve some basic questions and to understand important aspects

Let’s start

  1. Download the R code (It is briefly commented)
  2. You will find XXX where I will ask you to code a bit
  3. Part of the code after XXX, depends on successfully filling these line of code

Likelihood theory: Maximum likelihood estimation

Likelihood inference

The likelihood function

Let \(\mathcal{F}\) a parametric statistical model for the data \(y\), where \(f(y;\theta)\) is the density or probability function and \(\theta \in \Theta\) is a p-dimensional parameter. By considering \(f(y;\theta)\) as function of \(\theta\) with \(y\) fixed to the observed value, then the likelihood function of \(\theta\) based on \(y\), \(L: \Theta \rightarrow \mathbb{R}^{+}\), is \[L(\theta) = L(\theta; y) \propto f_Y(y; \theta) \,\,, \] where the proportionality is related to a constant \(c(y)>0\) that does not depend on \(\theta\).

On the basis of the data \(y\), \(\theta \in \Theta\) is more likely than \(\theta' \in \Theta\) as an index of the data generating model if \(L(\theta) > L(\theta')\)

Note

  • \(L(\theta)\) is not a density function (on \(\Theta\)).
  • It is convenient to work with \(\ell(\theta) = \log L(\theta)\).
  • If we assume independent observations for \(y=(y_1, \ldots, y_n)\), then \(L(\theta) \propto \prod_{i=1}^{n} f_{Y_i}(y_i; \theta)\) and, up to an additive constant, \(\ell(\theta) = \sum_{i=1}^{n} \log f_{Y_i}(y_i; \theta)\).
  • The value \(\hat \theta\) that maximizes \(L(\theta)\) (or equivalently \(\ell(\theta)\)) is the maximum likelihood estimate.

Binomial model

Maximum likelihood estimation in the Binomial model

Let \(y=(y_{1},\ldots, y_{n})\) a sample of i.i.d. values from a Bernoulli distribution, \(Y \sim \mbox{Be}(p)\). Then the likelihood function is

\[\mathcal{L}(p) = \prod_{i=1}^{n} p^{y_i} (1-p)^{1-y_i} \] and the log-likelihood function takes the form \[\ell(p) = \sum_{i=1}^{n} y_i \log (p) + (1-y_i) \log (1-p)\] The maximum likelihood estimate, \(\hat p\), is the sample proportion \(\hat p =\frac{1}{n}\sum_{i=1}^{n}y_i\). To derive it we obtain the score function \[\ell_{\star}(p)=U(p)=\frac{\partial \ell(p)}{\partial p}= \sum_{i=1}^{n} \frac{y_i}{p} - \frac{1-y_i}{1-p} = \frac{\sum_{i=1}^{n}y_i}{p} - \frac{n- \sum_{i=1}^{n}y_i}{1-p}\] By equating at zero the score function, we obtain the maximum likelihood estimate

\[U(p) = 0 \implies \hat p = \frac{1}{n}\sum_{i=1}^{n} y_i \]

Binomial model

Example

Suppose generating a random sample \(y\) of size \(n=100\) from Bernoulli distribution with parameter \(p=0.6\). We want make inference on \(p\). Thus,

We want write a function taking in argument the parameter and the data and returns the log-likelihood. We can do it in two ways:

  • Leveraging the dbinom() function
llik_bin <- function(theta, data){
  sum(dbinom(x = data, size = 1, prob = theta, log = TRUE))
}
  • Writing the log-likelihood function manually
llik_bin2 <- function(theta, data){
  sum(data) * log(theta) + (length(data) - sum(data)) * log(1-theta)
}

Binomial model

Example

We can check if both the functions return the same value. Thus, for instance by fixing \(p=0.5\)

llik_bin(0.5,y)
[1] -69.31472
llik_bin2(0.5,y)
[1] -69.31472

The maximum likelihood estimate is in closed form (later we will see how carrying out the ML using numerical optimisation, which is particularly useful when the ML estimate cannot be obtained analitically), so

MLEp  <- mean(y)
MLEp
[1] 0.63
llik_bin(MLEp, y)
[1] -65.89557
pgrid <- seq(0.01, 0.99, by = 0.01)

pgrid[which(llik_bin2(pgrid, y) == max(llik_bin2(pgrid, y)))]
[1] 0.63

Binomial model

Example

Note that above we used the function built manually. This because that function is vectorized, while the first one should be vectorized before using it. Indeed

llik_bin(c(0.5, 0.75, 0.99), y) # Wrong
[1] -98.22045
llik_bin2(c(0.5, 0.75, 0.99), y) # Correct
[1]  -69.31472  -69.41686 -171.02447
# The results above is due to this operation
sum(dbinom(x = y[seq(1,100,3)], size = 1, prob = 0.5, log = TRUE)) + 
sum(dbinom(x = y[seq(2,100,3)], size = 1, prob = 0.75, log = TRUE)) + 
sum(dbinom(x = y[seq(3,100,3)], size = 1, prob = 0.99, log = TRUE))
[1] -98.22045
# However, we can vectorize the funtion llik_bin by means of 
llik_bin_v <- Vectorize(llik_bin, 'theta')
llik_bin_v(c(0.5, 0.75, 0.99), y) # Correct
[1]  -69.31472  -69.41686 -171.02447

Binomial model

Example

Then we can plot it, including two vertical lines denoting where the true parameter value and the maximum likelihood estimate are located

par(mfrow = c(1, 2))
curve(llik_bin2(theta=x, data = y), 0, 1, 
      xlab= expression(p), ylab = expression(l(p)),
      main = "Log-likelihood function")
abline(v = p, col = "red")
abline(v = MLEp, col = "blue")
legend("topleft", legend = c(expression(p), expression(hat(p))),
       col = c("red", "blue"), lty = c(1, 1))
curve(llik_bin_v(theta=x, data = y), 0, 1, 
      xlab= expression(p), ylab = expression(l(p)),
      main = "Log-likelihood function")
abline(v = p, col = "red")
abline(v = MLEp, col = "blue")
legend("topleft", legend = c(expression(p), expression(hat(p))),
       col = c("red", "blue"), lty = c(1,1 ))

Binomial model

Fisher information for the Binomial model

We know that \(\hat \theta \stackrel{\cdot} \sim \mathcal{N}(\theta,\mathcal{I}^{-1}(\theta) )\), where \(\mathcal{I}(\theta)\) is the expected information matrix, that is \(\mathcal{I}(\theta)=E(J(\theta; Y))\), with \(J(\theta; Y)\) the observed information matrix, that is the negative hessian. In our one-dimensional parameter example \[J(p) = -\frac{\partial^2 \ell(p)}{\partial p^2} = \bigg[\frac{\sum_{i=1}^{n}y_i}{p^2} + \frac{(n-\sum_{i=1}^{n}y_i)}{(1-p)^2}\bigg]\] Thus, since \(\sum_{i=1}^{n} {Y_i}\sim Bin(n,p)\) we have \[\mathcal{I}(p)=E[J(p; Y)]= \bigg[\frac{\sum_{i=1}^{n}E[Y_i]}{p^2} + \frac{(n-\sum_{i=1}^{n}E[Y_i])}{(1-p)^2}\bigg] = \frac{n}{ p (1-p)}\] Then \(\hat p \stackrel{\cdot}\sim \mathcal{N}(p, \frac{p(1-p)}{n})\), where we can replace \(\mathcal{I}^{-1}(p)\) with the estimate \(i^{-1}(\hat p)=j^{-1}(\hat p)=\hat p (1- \hat p)/n\)

Reparametrizations

Logit reparametrization of the Binomial model

Let us consider the logit function \(\psi(p) = {\rm logit}(p)=\log\bigg(\frac{p}{1-p}\bigg)\), which gives the log-odds. We can use it to reparametrise the model, so the parametric space is unbounded. At such point, the parameter is expressed in the logit scale, and we need to re-express them in the original scale, that is \(p(\psi) = \exp(\psi)/(1+\exp(\psi))\). We can leverage the property of invariance under reparametrisations of the maximum likelihood estimator and so the maximum likelihood estimate is simply \(\hat \psi = \log\bigg(\frac{\hat p}{1- \hat p}\bigg)\). Then, we do not need to write down the log-likelihood under the reparametrisation obtain the MLE of \(\psi\) from it.

Reparametrizations

Logit reparametrization of the Binomial model

However, we can easily visualize the log-likelihood function under the reparametrisation as

psi <- function(theta) log(theta/(1-theta))
theta <- function(psi) exp(psi)/(1+exp(psi))
llik_bin_rep <- function(param, data) llik_bin2(theta(param), data)

curve(llik_bin_rep(param = x, data = y), -10, 10, main = "Log-likelihood function",
      xlab= expression(psi), ylab = expression(l(psi)))
abline(v = psi(p), col = "red")
abline(v = psi(MLEp), col = "blue")
legend("topleft", legend = c(expression(psi), expression(hat(psi))),
       col = c("red", "blue"), lty = c(1, 1))

Reparametrizations

Logit reparametrization of the Binomial model

The asymptotic distribution of the ML estimator of \(\psi\) is given by \(\hat \psi \stackrel{\cdot} \sim \mathcal{N}(\psi, V(\hat \psi))\)

By using the delta method, we can easily obtain

\[ V(\hat \psi) = V(\hat p) \bigg(\frac{d}{d p} \psi (p) \bigg)^2 = V(\hat p) \bigg(\frac{d}{d p} \log\frac{p}{1-p}\bigg)^2 = \frac{ p (1 - p)}{n} \frac{1}{p^2(1-p)^2} = \frac{1}{np(1-p)}\] and a consistent estimate of such variance is \(\hat V(\hat \psi) = \frac{1}{n \hat p (1 - \hat p)}\)

Monte Carlo Simulation: Asymptotic normality of the MLE

set.seed(1234)
R <- 5000
p <- 0.6
n2 <- 1000

prop <- rep(0, R)

for(i in 1:R){
  sample <- rbinom(n2, 1, p)
  prop[i] <- mean(sample)
}


par(mfrow=c(1,2))
hist(prop, freq = F, breaks = 30, main = expression(p))
curve(dnorm(x, mean = p, sd = sqrt((p * (1 - p)/n2))),
      lwd = 2, xlab = "", ylab = "", add = T)

hist(psi(prop), freq = F, breaks = 30, main = expression(psi))
curve(dnorm(x, mean = psi(p), sd = 1/sqrt((p * (1 - p) * n2) )),
      lwd = 2, xlab = "", ylab = "", add = T)

Numerical optimisation

R functions for numerical optimisation

In this example, we computed analytically some quantities and we obtain numerical values just by plugging into the formulas the input. However, remind the MLE for a parameter may not have an analytic solution. Many times we do not have a closed form for MLE estimates and we may need numerical optimisation. R provides various functions for performing numerical methods:

  • nlm(): minimizes a function using a Newton-type algorithm. It needs a starting value and does not allow constraints on the parameters. It is usually fast and reliable. It returns the ML estimate \(\hat{\theta}\) (estimate), the value of the likelihood \(-l(\hat{\theta})\) (minimum) and the hessian (hessian), if hessian = TRUE.

  • optim(): minimizes a function using Nelder-Mead, quasi-Newton and conjugate gradients algorithms. It includes an option for box-constrained optimization, and it requires a starting value. It returns the ML estimate \(\hat{\theta}\) (par) and the value of the likelihood \(-l(\hat{\theta})\) (value) and the hessian (hessian), if hessian = TRUE. You also can maximize the function by using fnscale = -1 in control argument.

  • nlminb(): often is more stable, robust and reliable than optim (in particular with “nasty” functions). It performs only minimization and does not yield numerical derivatives as output. It returns the ML estimate \(\hat{\theta}\) (par) and the value of the likelihood \(-l(\hat{\theta})\) (objective).

  • optimize(): by using a combination of golden section search and successive parabolic interpolation, searches in an interval for a minimum or a maximum (if maximum=TRUE) of a function. It returns the ML estimate \(\hat{\theta}\) (minimum) and the value of the likelihood \(l(\hat{\theta})\) (objective). Drawback: suited only for one-dimensional parameter.

Numerical optimisation

Binomial example

Here, we are simply using the Binomial example to show how (some of) these functions work.

Note: In the following, we consider and implement the negative log-likehood function, since some numerical optimisers that we will see are only able to perform minimisation of functions.

In addition, we implement the score and the observed information (used for comparison with the optimiser output)

nllik_bin <- function(theta, data){
  -sum(data) * log(theta) - (length(data) - sum(data)) * log(1-theta)
}
nllik_bin_rep <- function(param, data) nllik_bin(theta(param), data)
score <- function(theta, data){
  sum(data)/theta - (length(data) - sum(data))/(1 - theta)
}

obs_info <- function(theta, data){
   sum(data)/theta^2 + (length(data) - sum(data))/(1 - theta)^2
}

Numerical optimisation

Binomial example

To check if the score and the observed information are correctly implemented, we can compute the numerical gradient and hessian of the log-likelihood function, as implemented in the {numDeriv} R package.

Important

Note that in this case, we obtain them for the negative log-likelihood function, implying that we need to change sign to the score function.

library(numDeriv)
grad(nllik_bin, 0.5, data = y)
[1] -52
-score(0.5, data = y)
[1] -52
hessian(nllik_bin, 0.5, data = y)
     [,1]
[1,]  400
obs_info(0.5, data = y)
[1] 400

Numerical optimisation

nlm optimiser

Let’s start with the first optimisernlm(considering as starting value 0.5, which is quite close to the MLE; however, see the warnings printed on the console)

bin.nlm_start1 <- nlm(f = nllik_bin, p = 0.5,
                       data = y, hessian = TRUE)
bin.nlm_start1
$minimum
[1] 65.89557

$estimate
[1] 0.6299995

$gradient
[1] 1.421085e-08

$hessian
         [,1]
[1,] 429.0957

$code
[1] 1

$iterations
[1] 5
nllik_bin(bin.nlm_start1$estimate, data = y)
[1] 65.89557
obs_info(bin.nlm_start1$estimate, y)
[1] 429

Numerical optimisation

optim optimiser

The previous optimiser does not allow constraint on the parameters; however, there are some replacements in the algorithm that allow you to reach the result. If you want optimise a function which is constrained it should be better to consider optimiser allowing for box constrained optimisation, as for instance optim.

bin.optim_start1 <- optim(par = 0.5, fn = nllik_bin, hessian = TRUE,
  data = y, method = "L-BFGS-B", lower = 1e-7, upper = 1-1e-7)
bin.optim_start1
$par
[1] 0.6299996

$value
[1] 65.89557

$counts
function gradient 
       7        7 

$convergence
[1] 0

$message
[1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"

$hessian
         [,1]
[1,] 429.0048
bin.optim_start2 <- optim(par = 0.001, fn = nllik_bin,
  data = y, method = "L-BFGS-B", lower = 1e-7, upper =1-1e-7)
bin.optim_start2
$par
[1] 0.6299996

$value
[1] 65.89557

$counts
function gradient 
      12       12 

$convergence
[1] 0

$message
[1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"
optimHess(bin.optim_start2$par, nllik_bin, data = y)
         [,1]
[1,] 429.0048

Numerical optimisation

Note

  • The optim() function provides a lot of numerical methods (Nelder-Mead, quasi-Newton, conjugate-gradient methods and simulated annealings). As a drawback, the user has to set up carefully the initial guesses and the method, since the final solution may be quite sensitive to these choices…To compute the observed information, the function optimHess() computes numerical derivatives of generic functions.

  • As an alternative, we could consider again the reparametrisation introduced above. Generally, for numerical purposes, it is convenient to reparameterize the model in such a way that the new parameter space is unbounded. Then, we explore this by using nlm()

Numerical optimisation

bin.nlm_start1_rep <- nlm(f = nllik_bin_rep , p = 0,
                          data = y, hessian = TRUE)
bin.nlm_start1_rep
$minimum
[1] 65.89557

$estimate
[1] 0.5322163

$gradient
[1] 2.842171e-08

$hessian
        [,1]
[1,] 23.3094

$code
[1] 1

$iterations
[1] 4
nllik_bin_rep (bin.nlm_start1_rep$estimate, data = y)
[1] 65.89557
c(bin.nlm_start1_rep$estimate, psi(bin.nlm_start1$estimate), psi(mean(y)))
[1] 0.5322163 0.5322147 0.5322168
hessian(nllik_bin_rep, bin.nlm_start1_rep$estimate, data = y)
      [,1]
[1,] 23.31

Introduction to regression and diagnostic checks

Simple linear regression

The regression framework

Regression is trying to describe a dependent variable \(\mathbf{y}\) through some statistically significant predictors \(\mathbf{x}\), linking them through a function \(f(\cdot)\) (also referred as systematic component): \[\mathbf{y} = f(\mathbf{x})+\boldsymbol{\varepsilon},\] where \(\boldsymbol{\varepsilon}\) is a measure of noise implicit in modelling (also stochastic component).

Validating model assumptions

Checking the assumptions of a regression model is not just a theoretical matter, but it is something related to scientific reproducibility, scientific transparency and visualization power. Although it is impossible to write down an exhaustive list of ‘what to do for checking a model’, we put here just some initial steps for visualising and inspecting your variables, as well as to inspect possible correlations between them and the inclusion of further predictors. The suggested steps refer to the (normal) classical linear model, \(f(\mathbf{x}) = f(\mathbf{x}; \boldsymbol{\beta}) = \mathbf{X} \boldsymbol{\beta}\), that is

\[y_i=\beta_1x_{i1} + \beta_2x_{i2} + \ldots + \beta_px_{ip} +\varepsilon_i, \quad \varepsilon_i \stackrel{iid}\sim \mathcal{N}(0,1)\] where we usually assume \(x_{i1}=1\), but they contain principles that can be extended to statistical modelling in general.

Diagnostic checks

Before fitting a model

  • Examine the distribution of each of the explanatory variables, and of the dependent variable. Thus, skewness and presence of outliers can be detected. When the distribution is skew, consider transformations inducing symmetry.
  • Examine the scatterplot involving all the explanatory variables. If some pairwise plot shows non-linearity, consider some possible transformations.
  • Examine the range of the variables.
  • Examine the degree of correlation between the explanatory variables.

After fitting a model

  • Plot residuals against fitted values and check for patterns in the residual. Also, check for homoscedasticity.
  • Examine the Cook’s distance for possible outliers.

Linear regression in practice

Example

We consider the dataset nihills available in the {DAAG} package, containing data from the 2007 calendar for the Northern Ireland Mountain Running Association. The number of observations (races) is \(n = 23\), and for each of them we register the following variables:

  • Distances in miles: dist.
  • Amount of climbs expressed in feet: climb.
  • Record time in hours for males: time.
  • Record time in hours for females: timef.

Objective: We want to regress the times through some predictors.

Linear regression in practice

Loading the data

In the following, we focus on the time results for males. Then, we load the data.

library(DAAG)
data(nihills)
n <- nrow(nihills)

Linear regression in practice

The data structure

We explore the data structure, the first six and the last six rows.

str(nihills)
'data.frame':   23 obs. of  4 variables:
 $ dist : num  7.5 4.2 5.9 6.8 5 4.8 4.3 3 2.5 12 ...
 $ climb: int  1740 1110 1210 3300 1200 950 1600 1500 1500 5080 ...
 $ time : num  0.858 0.467 0.703 1.039 0.541 ...
 $ timef: num  1.064 0.623 0.887 1.214 0.637 ...
head(nihills)
                   dist climb      time     timef
Binevenagh          7.5  1740 0.8583333 1.0644444
Slieve Gullion      4.2  1110 0.4666667 0.6230556
Glenariff Mountain  5.9  1210 0.7030556 0.8869444
Donard & Commedagh  6.8  3300 1.0386111 1.2141667
McVeigh Classic     5.0  1200 0.5411111 0.6375000
Tollymore Mountain  4.8   950 0.4833333 0.5886111
tail(nihills)
                 dist climb      time     timef
Slieve Bearnagh   4.0  2690 0.6877778 0.7991667
Seven Sevens     18.9  8775 3.9027778 5.9855556
Lurig Challenge   4.0  1000 0.4347222 0.5755556
Scrabo Hill Race  2.9   750 0.3247222 0.4091667
Slieve Gallion    4.6  1440 0.6361111 0.7494444
BARF Turkey Trot  5.7  1430 0.7130556 0.9383333

Linear regression in practice

Summary

A summary report of the variables allows exploring some characteristics of the variables.

summary(nihills)
      dist            climb           time            timef       
 Min.   : 2.500   Min.   : 750   Min.   :0.3247   Min.   :0.4092  
 1st Qu.: 4.000   1st Qu.:1205   1st Qu.:0.4692   1st Qu.:0.6158  
 Median : 4.500   Median :1500   Median :0.5506   Median :0.7017  
 Mean   : 5.778   Mean   :2098   Mean   :0.8358   Mean   :1.1107  
 3rd Qu.: 5.800   3rd Qu.:2245   3rd Qu.:0.7857   3rd Qu.:1.0014  
 Max.   :18.900   Max.   :8775   Max.   :3.9028   Max.   :5.9856  

Simple Linear Regression

Exploring the relationship

In a first step, we consider only the predictor dist for regressing the dependent variable time. Let us have a quick look to the data by means of a scatterplot:

plot(nihills$dist, nihills$time, pch = 19, xlab = "dist", ylab = "time")

Simple Linear Regression

Fitting the model

Apart one, all the data times are lower than 2 hours. As an initial attempt, let’s fit a simple linear model (and see a basic model output): \[\texttt{time}_i =\beta_1+ \beta_2\texttt{dist}_i+\varepsilon_i, \ \ i=1,\ldots,n, \quad \varepsilon_i\stackrel{iid}\sim \mathcal{N}(0,\sigma^2).\]

lm_dist <- lm(time ~ dist, data = nihills)
lm_dist

Call:
lm(formula = time ~ dist, data = nihills)

Coefficients:
(Intercept)         dist  
    -0.3253       0.2009  

Simple Linear Regression

Detailed model output

A more detailed summary of the fitting procedure can be obtained by using the summary() function

summary(lm_dist)

Call:
lm(formula = time ~ dist, data = nihills)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.42812 -0.12193 -0.00250  0.09232  0.43028 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.32530    0.07629  -4.264 0.000345 ***
dist         0.20094    0.01120  17.934 3.28e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1935 on 21 degrees of freedom
Multiple R-squared:  0.9387,    Adjusted R-squared:  0.9358 
F-statistic: 321.6 on 1 and 21 DF,  p-value: 3.278e-14

Simple Linear Regression

Understanding the summary output

  • Call: The fitted model.
  • Residuals: Some residual summaries, mainly quantiles (asymmetry?).
  • Coefficients:
    • Estimates: coefficient estimates, \(\hat \beta_j, j = 1, \ldots, p\).
    • Std. Error: estimated standard deviation of the estimators \(\hat \beta_j\), that is \(\sqrt{ \widehat {V (\hat \beta_j)}}\).
    • t-value: the (observed) values of the test statistic for testing the nullity of the coefficients, \(H_0: \beta_j=0\) against \(H_1: \beta_j\neq0\), that is \(t_j = \hat \beta_j / \sqrt{ \widehat {V (\hat \beta_j)}}\).
    • \(Pr(>|t|)\): the corresponding p-values, \(2P(T_{n-p} > |t_j|)\).
    • Some symbols denoting where the p-values fall, *** if \(0 < p \leq 0.001\), ** if \(0.001 < p \leq 0.01\), * if \(0.01 < p\leq 0.05\), . if \(0.05 < p \leq 0.1\), otherwise nothing.

Simple Linear Regression

Understanding the summary output

  • Residual standard error: the square root of the unbiased sample variance of the residuals, that is \(\sqrt{s^2}\), with the corresponding degrees of freedom \(n-p\).
  • Multiple R-squared: the \(R^2\) coefficient.
  • Adjusted R-squared: the adjusted \(R^2\) coefficient (given by \(R^2 -\frac{p-1}{n-p}(1-R^2)\)).
  • F-statistic: the observed value for the test statistic comparing the full model with the null model (that is the model including only the intercept), with the corresponding degrees of freedom and the related p-values. Namely, for testing \(H_0: \beta_2= \ldots = \beta_p=0\) against the alternative that at least one is different from zero.

Simple Linear Regression

Residual plots

Then, we can visualise some aspects of the fitted model. The first step, and likely the most important one, is to explore the residuals plots

par(mfrow = c(2, 2))
plot(lm_dist)

Simple Linear Regression

Understanding residual plots

  • (Raw) residuals: \(e_i = y_i - \hat y_i\)
  • Standardised residuals (internal studentized residuals): \(r_i = e_i/(s \sqrt{1-p_{ii}})\), where \(p_{ii}\) is the \(i\)-th diagonal element of the projection matrix (\(P = X (X^\top X)^{-1} X^\top\))

Plots (graphical assessment of certain linear model assumptions):

  • Residuals vs fitted: \((\hat y_i, e_i)\) (linearity and homoscedasticity assumptions).
  • Q-Q residuals: \((\Phi^{-1}((i-0.5)/n), r_{(i)})\) (normality assumption).
  • Scale-Location: \((\hat y_i, \sqrt{|r_i|}\) (homoscedasticity assumption).
  • Residuals vs Leverage: \((p_{ii}, r_i)\), with the inclusion of contour level of \(q^2(D,h)=D p (1-h)/h\), that is the lines \(q = \pm \sqrt{D p (1-h)/h}\), where \(D\) is the Cook’s distance.

Simple Linear Regression

Visualizing the fitted model

Then, we can produce the scatterplot including the regression line

Of course, the fit seems good enough. However, the relationship between time and dist seems not to be purely linear…and residual plots seem to suggest a lack of fit for a few points.

par(mfrow = c(1, 1))
with(nihills, plot(dist, time, pch = 19))
abline(coef(lm_dist), col = "red", lty = "solid")
text(13, 3, expression(time == hat(beta)[0] + hat(beta)[1] * dist), col = "red")
points(nihills$dist, predict(lm_dist), col = "red", pch = 19, cex = 0.8)
nihills[17,]
segments(nihills[17,]$dist, nihills[17,]$time, nihills[17,]$dist, fitted(lm_dist)[17], lty = "dashed")

The summary output of a lm object

Computing model quantities by hand

Here, we obtain the quantities of the output of the summary() function applied to an lm object by hand. In the following, the computations are done using matrix algebra (albeit we are working with a simple linear model). In this way, the following lines of code can be used to (eventually) carry out some checks using the multiple linear model.

### Save the results of the fitted model summary in a object 
sum_lm_dist <- summary(lm_dist)
### Get the needed quantities 
X <- model.matrix(~ dist, data = nihills)
y <- nihills$time
p <- ncol(X)

The summary output of a lm object

### Estimated coefficients 
beta.hat <- solve(t(X) %*% X) %*% t(X) %*% y
beta.hat
                  [,1]
(Intercept) -0.3252964
dist         0.2009417
sum_lm_dist$coefficients[,1] # Check 
(Intercept)        dist 
 -0.3252964   0.2009417 
### Fitted values 
y.hat <- X %*% beta.hat
max(abs(predict(lm_dist) - y.hat)) # Check
[1] 3.108624e-15
### Residuals 
residuals <- y - y.hat 
max(abs(residuals(lm_dist) - residuals)) # Check
[1] 2.720046e-15
cbind(summary(residuals), round(summary(sum_lm_dist$residuals), 5)) # Check
       V1                        
 "Min.   :-0.428118  " "-0.42812"
 "1st Qu.:-0.121926  " "-0.12193"
 "Median :-0.002496  " "-0.0025" 
 "Mean   : 0.000000  " "0"       
 "3rd Qu.: 0.092318  " "0.09232" 
 "Max.   : 0.430276  " "0.43028" 

The summary output of a lm object

### Estimate of error variance using residuals 
s2 <- sum(residuals^2)/(n - p); s2
[1] 0.03744742
sum_lm_dist$sigma^2 # Check 
[1] 0.03744742
### Estimates of stdev of regression coefficient estimator
stdb <- sqrt(diag(s2 * solve(t(X) %*% X))); stdb
(Intercept)        dist 
 0.07628629  0.01120431 
sum_lm_dist$coefficients[, 2] ## Check 
(Intercept)        dist 
 0.07628629  0.01120431 
### Observed test statistics 
beta.hat/stdb
                 [,1]
(Intercept) -4.264153
dist        17.934328
as.numeric(sum_lm_dist$coefficients[, 3])  ## Check 
[1] -4.264153 17.934328
### p-values 
2 * pt(abs(beta.hat/stdb), df = n - p, lower.tail = FALSE)
                    [,1]
(Intercept) 3.454758e-04
dist        3.278480e-14
as.numeric(sum_lm_dist$coefficients[,4]) ## Check
[1] 3.454758e-04 3.278480e-14

The summary output of a lm object

### r-squared 
Tot_SS <- sum((y - mean(y))^2) 
Res_SS <- sum((y.hat - y)^2)
Mod_SS <- sum((y.hat - mean(y))^2)
R_sq <- Mod_SS/Tot_SS; R_sq
[1] 0.9387112
with(nihills, cor(time, dist))^2
[1] 0.9387112
sum_lm_dist$r.squared  # Check 
[1] 0.9387112
### adjusted r-squared
1 - (Res_SS/(n - p))/(Tot_SS/(n - 1))
[1] 0.9357927
R_sq - (1 - R_sq)*(p - 1)/(n - p)     # Check 
[1] 0.9357927
sum_lm_dist$adj.r.squared # Check
[1] 0.9357927

The summary output of a lm object

### Null model 
X0 <- X[, 1]
p0 <- 1
beta.hat0 <- solve(t(X0) %*% X0) %*% t(X0) %*% y
beta.hat0
          [,1]
[1,] 0.8357971
beta.hat0 <- mean(y)

y.hat0 <- mean(y)
residuals0 <- y - y.hat0 

### F-statistic 
F <- ((sum(residuals0^2) - sum(residuals^2))/(p - p0))/(sum(residuals^2)/(n - p))
F
[1] 321.6401
sum_lm_dist$coefficients[2, 3]^2   # Check
[1] 321.6401
sum_lm_dist$fstatistic # Check
   value    numdf    dendf 
321.6401   1.0000  21.0000 
### p-value
pf(F, p0, n - p, lower.tail = FALSE)
[1] 3.27848e-14